Skip to content

fix(tools/talis): fibre-experiment race fixes & loadgen tooling#3305

Merged
julienrbrt merged 3 commits intoevstack:julien/fiberfrom
walldiss:pr1-talis-fibre-fixes
May 3, 2026
Merged

fix(tools/talis): fibre-experiment race fixes & loadgen tooling#3305
julienrbrt merged 3 commits intoevstack:julien/fiberfrom
walldiss:pr1-talis-fibre-fixes

Conversation

@walldiss
Copy link
Copy Markdown

@walldiss walldiss commented May 2, 2026

Issues

Operating the talis-driven Fibre throughput experiment surfaced three classes of bug that consistently broke fresh cluster bring-up and made one set of test runs unrepeatable:

  1. Bootstrap races against an unready chain. set-host and deposit-to-escrow were one-shot calls — --yes returns the txhash before block inclusion, so the operator saw "success" while the tx silently bounced (celestia-app is not ready; please wait for first block). All subsequent steps then ran against an unregistered host with no escrow.
  2. Non-atomic keyring scp on bootstrap-evnode. scp -r creates keyring-test/ on the target before any of its files transfer; the evnode init script's [ -d keyring-test ] poll then triggered mid-transfer and evnode launched with a partial keyring (fibre-0.info: key not found).
  3. talis up lied about partial failures. When some EC2 instances failed to launch, CreateAWSInstances returned nil error and talis up printed "deployment complete" against a partial cluster, breaking later steps with confusing errors.

A few smaller bugs also fixed here:

  • fibre_setup.go queried celestia-appd query fibre escrow (which doesn't exist — it's escrow-account), so the verification loop wedged for 3 minutes regardless of actual state.
  • sshExec used cmd.CombinedOutput, mixing SSH Warning: Permanently added chatter into bytes the CLI handed to Sscanf("%d") — the providers cross-check parsed those warnings as 0 and looped to its 5-minute deadline even when SSH-direct returned 3.
  • evnode-txsim used the stdlib default MaxIdleConnsPerHost=2, so any --concurrency > 2 opened fresh TCP/TLS sockets per request and turned the loadgen itself into the binding constraint.

Solution

  • fibre_setup.go: chain-ready preamble, retry-until-confirmed loops for set-host and fibre-0 deposit-to-escrow, switched verification query to escrow-account and key on "found":true, CLI cross-check that all validators registered.
  • fibre_bootstrap_evnode.go: stage keyring under /root/.keyring-fibre.staging/ then mv into place atomically.
  • New fibre_experiment.go umbrella command driving up → genesis → deploy → setup-fibre → start-fibre → fibre-bootstrap-evnode so the operator can't skip a verification step.
  • aws.go: CreateAWSInstances now returns a joined error on any partial failure.
  • download.go: sshExec returns stdout only via cmd.Output() and adds -q -o LogLevel=ERROR to silence SSH warning chatter.
  • evnode-txsim: bump per-host idle conns to 2 × concurrency; always-on pprof on 127.0.0.1:6060 for in-process profiling under live load.

Test plan

  • talis fibre-experiment runs unattended end-to-end against a fresh cluster
  • talis CLI returns non-zero when talis up partial-fails
  • evnode-txsim saturates with --concurrency 32 (verified via pprof — request hot path is now write(2)/body copy, not TCP handshake)

walldiss added 3 commits May 3, 2026 01:41
Three race conditions surfaced repeatedly on a fresh AWS bring-up of
the Fibre throughput experiment. Each one had the same shape: a
talis subcommand "succeeded" at the CLI level (or returned the txhash
with --yes) before the chain had actually applied the work, leaving
downstream steps to fail in confusing ways. This commit makes each
step verify *outcome*, not just *invocation*, so the experiment can
go from a fresh `talis up` to a running loadgen without manual
intervention.

  • setup-fibre script (fibre_setup.go) now:
    - polls `celestia-appd status` for `latest_block_height>0`
      before submitting any tx — fixes the silent-noop where
      set-host + 100× deposit-to-escrow all bounced with
      "celestia-app is not ready; please wait for first block";
    - retries `set-host` in a loop until the validator's host
      shows up in `query valaddr providers` — fixes the case
      where --yes returns the txhash before block inclusion and
      the tx silently lands in the mempool but never confirms;
    - verifies fibre-0's escrow account is funded on-chain before
      the tmux session exits — same silent-failure mode as
      set-host, but on the deposit side.
    The talis-CLI step also now cross-checks all validators are
    registered from a single vantage point before returning, so a
    concurrent set-host race surfaces as an error instead of a
    half-empty provider list start-fibre would cache forever.

  • fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages
    the keyring scp into a tmp directory and `mv`s it atomically
    into place. The previous direct `scp -r` to
    /root/keyring-fibre/keyring-test created the directory before
    transferring its contents — the evnode init script's
    `[ -d keyring-test ]` poll passed mid-transfer, the daemon
    launched with no fibre-0.info, and crashed with `keyring entry
    "fibre-0" not found`.

  • evnode_init.sh (genesis.go) now waits for the specific
    keyring-test/fibre-0.info file rather than just the
    keyring-test directory. Belt-and-braces: the bootstrap mv is
    already atomic on the same filesystem, but the file-level
    guard means a hand-pushed keyring (not via talis) can't trip
    the same race.

  • New `talis fibre-experiment` umbrella command runs
    up → genesis → deploy → setup-fibre → start-fibre →
    fibre-bootstrap-evnode in order. Each step uses the same
    binary as a subprocess; failures in any step abort the chain.
    Operator goes from a prepared root dir to a running loadgen
    with one command, instead of remembering the sequence.

Verified by 5-min sustained loadgen against julien/fiber HEAD with
PR evstack#3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok,
up from the prior 24.57 MB/s baseline (the gap is PR evstack#3287's
overlapping uploads — these talis fixes just stop the deploy from
silently breaking before throughput matters).
Three follow-up bugs surfaced from the PR evstack#3303 follow-up
verification run on a 3-validator AWS Fibre cluster:

- aws.go: CreateAWSInstances exited 0 even when individual
  instance launches failed, so `talis up` lied about success
  and downstream steps proceeded against a partial cluster.
  Returns a joined error now so failure cascades stop early.

- download.go: sshExec used cmd.CombinedOutput, mixing SSH
  warnings (the "Warning: Permanently added '...'..." chatter
  on stderr) into bytes the caller hands to fmt.Sscanf("%d").
  The CLI-side providers cross-check parsed those warnings
  as 0 and looped until its 5-min deadline even though a
  direct SSH query showed all 3 providers registered. Switch
  to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR`
  to silence the chatter for any caller that does combine
  streams.

- fibre_setup.go: the per-validator escrow verification used
  `celestia-appd query fibre escrow` which doesn't exist —
  the actual subcommand is `escrow-account`. The query
  errored on every retry, the grep for "amount" never
  matched, and the script wedged on the 3-min deadline
  reporting `FATAL: fibre-0 escrow not present`. Switch to
  `escrow-account` and key on `"found":true` (the explicit
  existence flag in the response). Also wrap the fibre-0
  deposit-to-escrow itself in a retry loop matching set-host
  — same `--yes`-returns-before-inclusion silent-failure
  mode bit it. fibre-1..N stay best-effort.
Two diagnostic improvements for the load generator:

1. http.Transport.MaxIdleConnsPerHost defaults to 2 in stdlib.
   With --concurrency=8 (or higher), 6+ goroutines per cycle had
   to open fresh TCP+TLS sockets per request because the pool
   couldn't hold their idle conns between requests. Bump
   MaxIdleConns / MaxIdleConnsPerHost / MaxConnsPerHost to
   2*concurrency so every active sender has a reusable keep-alive
   socket, eliminating handshake churn from the hot path.

2. Always-on net/http/pprof on 127.0.0.1:6060. evnode-txsim is a
   load tester, not a production daemon, so cost of always serving
   profiling is acceptable; the payoff is being able to grab CPU
   profiles under live load without re-deploying the binary —
   `ssh -L 6060:127.0.0.1:6060 root@loadgen \
     go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`.

A profile captured this way under c=8 traced the per-request hot
path: 25.5% in kernel write(2), 25% in net/http body marshaling.
That diagnostic surfaced that the c6in.2xlarge loadgen was the
binding constraint for the experiment at ~22 MB/s, not evnode or
DA — a finding we'd have spent another debug round chasing
without the in-process profiler.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c97e7c08-2a60-4f33-b0a2-bbf95f1f8fcd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: Turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get your free trial and get 200 agent minutes per Slack user (a $50 value).


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@julienrbrt julienrbrt merged commit 513ce9b into evstack:julien/fiber May 3, 2026
17 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants